Species Distribution Modeling: Blue Oak (Quercus douglasii)

R Spatial Analysis Species Distribution Modeling MaxEnt

An intro to species distribution modeling for an iconic California woodland species.

Steven Cognac true
2022-08-04

cover image courtsey of California Academy of Sciences, 2016

In this post I’ll go through an introduction to species distribution modeling (SDM) using the dismo package. These is machine learning (algorithm) based approach to SDM. I’ll walk through the steps of acquiring species occurrence data, gathering underlying climate data, making MaxEnt predictions, and calibrating our model.

We’ll use an SDM to make inferences about the distribution of suitable habitat for a species of interest. There are many SDM approaches out there all with their own pros and cons. Check out this table comparing different SDM options. While there are many limitations of SDMs, their ability to gain ecological insight and predict species distribution across a landscape is valuable.

For this exercise we’ll use the blue oak (Quercus douglasii) tree. Blue oak trees are native and and endemic to California. They occur with a narrow range along the valleys and slopes of the mountain ranges that surround the Central Valley in California.

Quercus douglasii; Blue Oak. 2018 Gary McDonald

The steps we’ll follow to for out SDM are as follows:

Installing Packages

If you are having problems installing the rJava package like so:

Here are some troubleshooting options for installation on a macOS Monterey with an M1 architecture (ymmv).

Here are the steps I took to get rJava working:

  1. Uninstall rJava with remove.packages().
  2. Download Java 8 (LTS) x86_64-bit JDK for MacOS from Azul Core Platform .
  3. Run R CMD javareconf in the terminal. This reconfigures JAVA_HOME directories.
  4. Re-install rJava in Rstudio and voila.

Exploring Species Data

We’ll set up our file structure here.

Let’s download our species data. We’ll limit it to the Global Biodiversity Information Facility (GBIF) data library using the spocc::occ function. Other data repositories are available (iNaturalist, eBird, Bison, iDigBio), but a lot of that data eventually ends up in GBIF. We’ll filter our results for CA only observations (native range) and remove “PRESERVED_SPECIMEN” which are Herbaria records. in CA To do this we’ll We then take the obs_csv dataframe and covert it to a spatial object using our coordinate data.

The function file.exists() returns a logical vector. I’ll use that notation to avoid re-downloading/re-creating files each time the document is compiled.

Selecting Environmental Data

We’ll now get the underlying climate data for our SDM.

We’ll use the two terrestrial datasets, “WorldClim” and ENVIREM”.

There are 86 layers to choose from. Not all of them are needed so let’s select a few layers that are most relevant. We’ll use:

Layer Code name description
WC_alt Altitude Altitude
WC_bio1 Annual mean temp Annual mean temp
WC_bio2 Mean diurnal temp range Mean of the monthly (max temp - min temp)
WC_bio6 Minimum temp Minimum temp of the coldest month
WC_bio12 Annual precipitation Annual precipitation
ER_tri Terrain roughness index Terrain roughness index
ER_topoWet Topographic wetness SAGA-GIS topographic wetness index

These raster layers are currently global. Let’s crop the environmental rasters to a reasonable study area. There are two good options to clip our rasters:

  1. create a convex hull around our species observations using the sf package (sf::st_convex_hull(st_union()))
  2. use the boundary of California. We’ll use this second option grabbing boundaries from the USAboundaries package

Now let’s clip our environmental layers to this bounding box.

Extract Raster values from Observations

This table above is the **data* that feeds into our SDM (y ~ X), where: y is the present column with values of 1 (present) and X is all other columns: WC_alt, WC_bio1, WC_bio2, WC_bio6, WC_bio12, ER_tri, ER_topoWet, lon, lat

MaxEnt (Maximum Entropy) Model

Maxent is probably the most commonly used species distribution model (Elith 2011) since it performs well with few input data points, only requires presence points (and samples background for comparison) and is easy to use with a Java graphical user interface (GUI).

MaxEnt requires presence-only observation points.

This is MaxEnt version 3.4.3 

In our Variable Contribution plot we are checking out the percentage our environmental predictors contribute in our model. We see here that the top three predictors WC_bio12, WC_bio6, and WC_bio1 contribute approximately 90%.

Response Curves

Response curves show how each environmental variable affects the MaxEnt prediction and how that prediction changes as each variable is varied. The default in the dismo::response package is vary one variable while setting all other environmental variables to their mean. In other words, the curves show the marginal effect of changing exactly one variable.

Let’s make raw predictions of for species location

Generate Raw Predictions

Model Calibration

While these raw results are a good start, it’s important to calibrate your model. It is critical to check for multicollinearity between independent variables in your model. When you have multiple independent variables (in our case, our X values) that have high intercorrelation, this can lead to skewed or misleading results. This essentially widens your confidence intervals to produce less reliable results.

We first use a pairs plot to identify pairwise correlations between variables to identify pairs or groups of variables that are highly correlated. You can see below we have some strongly correlated variables. For our model, we’ll set our pairwise correlation threshold at 0.7.

We can detect multicollinearity (strong correlation between two or more independent variables) with a Variance Inflaction Factor (VIF) with the usdm::vif() function. A value with a VIF greater than 10 indicates that our model has a collinearity problem. We can reduce

We see that WC_alt, WC_bio1, and WC_bio6 all have VIF greater than 10. Let’s remove those from our raster stack.

3 variables from the 7 input variables have collinearity problem: 
 
ER_topoWet WC_bio6 WC_bio1 

After excluding the collinear variables, the linear correlation coefficients ranges between: 
min correlation ( WC_bio2 ~ WC_alt ):  -0.1005908 
max correlation ( ER_tri ~ WC_alt ):  0.6104043 

---------- VIFs of the remained variables -------- 
  Variables      VIF
1    WC_alt 1.639451
2   WC_bio2 1.156533
3  WC_bio12 1.474288
4    ER_tri 2.169617

Let’s re-compute our

Citation

For attribution, please cite this work as

Cognac (2022, Aug. 4). Steven Cognac: Species Distribution Modeling: Blue Oak (Quercus douglasii). Retrieved from https://github.com/cognack/2022-08-04-speciesdistmodeling/

BibTeX citation

@misc{cognac2022species,
  author = {Cognac, Steven},
  title = {Steven Cognac: Species Distribution Modeling: Blue Oak (Quercus douglasii)},
  url = {https://github.com/cognack/2022-08-04-speciesdistmodeling/},
  year = {2022}
}